This course give as a general overview over the possibilities to work with data. No matter the topic, all scientists need to work with lots of data, organize it and analyze it in order to be able to understand it, explain it and represent it. This course will give us the basic tools to do so. Mainly working with R programming language and R-studio software which are probably the most widely used open source tools thanks to its robustness for statistical computing and graphics.
The core values of the course are based on the importance of open data and knowledge sharing.
Link to GitHub repository: https://github.com/elirams/IODS-project
In this chapter we will go through how to read, interpret and represent data. Then we will learn how to apply and interpret simple and multiple regressions. And finally we will go through the most important methods to validate regression models.
First we will read the data called students2014, explore its structure and dimensions:
learning2014<-read.csv("C:/Users/Elisabet/Desktop/ELI/MASTER'S DEGREE/Open Data Science/GitHub/IODS-project/IODS-project/learning2014.csv", header = TRUE)
dim(learning2014)
## [1] 166 7
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
We can see the our data has 7 variables (columns), and 166 observations. The data is about the different approaches of learning by students. The variables measured are:
Gender: It has 2 levels: M for male, F for female
Age: age of the students. Integral data type variable.
Attitude: global attitude towards statistics.
Deep: stands for deep approach learning and it is a combination of variables including: seeking meaning, relating ideas and use of evidence. This variables measure sto what extent the student has the intention to maximize understanding, with a true commitment to learning.
Stra: stands for strategic approach and it is a combination of variables including: organized studying and time management. This variables measures to what extent the students organize their studying
Surf: stands for surface approach learning and it is a combination of variables including: lack of purpose, unrelated memorising and syllabus-boundness.This variable measures to what extent the student memorizes without understanding, with a serious lack of personal engagement in the learning process.
Points: the exam points that students obtained. It is an integral data type. The maximum points that student could get in the exam is 30 plus some extra points from participating in the study, and the minimum points for passing is 12.
The variables attitude, deep, stra and surf are numerical data type with ranging values from 1 to 5. 1 meaning that the student is very far from one specific approach and 5 that the student is very close to that approach.
We will do a graphical overview of the data and show the summaries of the variables, but first we install the packages and bring them to R studio:
library(ggplot2)
library(GGally)
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
For the graphical overview of the data we will use ggpairs plot from GGally package (which is an extension of the ggplot2), because it is a more advance plot which show us at the same the distribution of the data as well as the relationships between variables. It creates scatter plots and give the correlation value. We specify the combo argument in the lower list because we have both discrete and continuous variables. Mapping describes the aesthetic parameters and in this case we specify to colour male and female.
p <- ggpairs(learning2014, mapping = aes(col=gender), lower=list(combo=wrap("facethist", bins=20)))
p
On the summary of our data we can observe how the variables are distributed.For gender, we have 110 female and 56 male. The average age of respondants is 25.5, with a minimum age of 17 and a maximum of 55 years old. Most of the respondants have ages from 21 to 27 years old. The mean attitude towards statistics is 3.1, women has higher mean attitude than men. Mean deep approach is 3.6 and there is no big differences between men and women. Strategic approach has a mean of 3.1, and woman has slightly higher points. Surface approach learning has a mean of 2.7 and there is no significant differences between men and woman. The average exam points is 22.7, lowest points is 7 and maximum is 33.
With the ggpairs plot we can see graphically the distribution of the variables in the middle diagonal line, sepparated for men (blue) and women (pink). We can also see on the top row a box plot showing the median, first and third quartile for each of the variables. On the left side we have the scatter plots and on the right side the correlations values. We can observe that the variables with clearly higher correlations is attitude with points, with 0.44. Then exam points are also slightly related with strategic and surface approach with 0.15 and 0.14 correlation, respectively.
Now we choose three variables as explanatory variables of the target variable (dependent) which is the exam points. Based on the correlation values, we will choose as explanatory variables: attitude, stra and surf.
First we write the model, since we have more than one variable it is a multiple regression with the formula like y ~ x1 + x2 + x3.
my_model <- lm(points ~ attitude + stra + surf, data=learning2014)
Let’s look at the summary of our model:
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
On the output we see a table where the first column shows the estimation of each of the coefficients or parameters of the model, then the standard error for these estimations. Finally the T-test value and the p-value to accept or reject the nulle hypothesis. Based on the results, the only variable with p-value close enough to zero to conclude that the nulle hypothesis is false and therefore affirm that there is statistical relationship between variables, is attitude. Therefore stra and surf are not statistically related to points.
We will then rewrite the model including only attitude as the explanatory variable:
my_model2 <- lm(points ~ attitude, data = learning2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
So our final model now looks like this:
y = points X = attitude
\[y = 11.6 + 3.5x\]
\[y = A + Bx\]
A is the point where the regression line intercepts the Y axis and B is the slope of the regression and it shows the relationship between x and y. So in this case when attitude increases one unit, the points increases 3.5 units. Based on what we have previously explained, we can see that there is statistical significance to say that attitude is related to points.
The multiple R squared of the model measure the amount of variation in the target variable that can be explained by the explanatory variables. In this case then, the variable attitude explains 19% of the variation in exam points.
The model assumes that the residuals (difference between predicted values and real values) have a constant variance, so that they don’t depend on the explanatory variable. In the following plot we can see how points are really randomly distributed, so that they don’t follow any pattern. For this reason we can say that the assumption is reasonable.
plot(my_model2, which = 1)
The model assumes that the residuals are normally distributed. With the following plot we can prove if that assumption is correct. In this case, we can see how the points fall pretty well into the line for almost all values, therefore we can say that the normality assumption is reasonable.
plot(my_model2, which = 2)
Finally, in this plot we can identify if there is single observations which have an unusually high impact in the model. If there exists some single observations which differs so much from the rest, we say that it has high leverage. In this case, we can see how the leverage of the points is similar, there is no single points which have much higher leverage than the others. Therefore, we have regular leverage and that gives more validity to our model.
plot(my_model2, which = 5)
In this chapter we will analyse the relationships between the target variable, high/low alcohol consumption, and other variables. Because the target variable is discrete (2 values, TRUE or FALSE), we will use logistic regressions in order to do it.
We obtained the original data from the Machine Learning Repository. It is a combination of 2 data sets, student performance in mathematics and in portuguese, from the Student Performance Data Set. It contains 35 variables and 382 observations for each one of them.
We have information about the students and its background, such as sex, age, family size, parents’ education, parents’job, etc.
It also contains information about the students’ grades, the most rellevant is G3 which relates to the final grades.
Because in this exercise we are interested in studying the high consumption of alcohol among students, we created a variable called “alcohol use” (alc_use), which is the average alcohol consumption of the week and the weekend.
Finally we created the variable that we will use as target variable in this exercise, which is “high alcohol consumption”" (high_use). This variable is TRUE for “alcohol use” higher than 2 and FALSE otherwise.
For further information about the variables you can check: https://archive.ics.uci.edu/ml/datasets/Student+Performance
alc <- read.table("C:/Users/Elisabet/Desktop/ELI/MASTER'S DEGREE/Open Data Science/GitHub/IODS-project/IODS-project/alc.csv", header = TRUE, sep = ",")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F,...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T,...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, services...
## $ Fjob <fctr> teacher, other, other, services, other, other, oth...
## $ reason <fctr> course, course, other, home, home, reputation, hom...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mother, mo...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no, no,...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no, yes...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, no, n...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
In order to study the relationships between high alcohol consumption and some other variables, we will choose 4 variables which may have some kind of relationship with the target variable:
*Failures: number of past class failures. Numeric from 0 to 4. Assuming a positive correlation with the target variable. The more alcohol consumption, the more failures.
*Absences: number of school absences. Numeric from 0 to 93. Also assuming positive correlation, higher alcohol consumption, more absences.
*Final Grades: final grade. Numeric from 0 to 20. Assuming a negative correlation, higher alcohol consumption, lower grades.
*Romantic: if the student is in a romantic relationship. Binary, yes or no. Assuming that students in a romantic relationship are more centered in the relationship and so, they do not go out so much to drink with friends.
First we will look at the distribution of our chosen variables with bar plots.
We can see that the target variable, high_use, is distributed so that less than half of the students have actually high alcohol consumption rates. Absences are quite sparsely distributed with many students with very low amount of absences but also many students with high amount of absences. The failures figures speaks by itself, the vast majority of students have not failed any class in the past. The grades are fairly sparsely distributed but there is higher amount of students with lower grades. Finally we can see that a bit less than half of the students are in a romantic relationship.
chosen_variables <- subset(alc, select = c("high_use", "failures", "absences", "G3", "romantic"))
library(tidyr)
library(ggplot2)
gather(chosen_variables) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped
With a barplot of high_use and romantic, we can see that proportionality, students in a romantic relationship have less amount of high alcohol consumption.
g1 <- ggplot(alc, aes(x = high_use))
g1 + geom_bar() + facet_wrap("romantic")
Next we will draw some box plots with high_use as the dependent variable and failures, absences, grades and romantic as the dependent ones.
For the first one, with high use vs grades, we also check the differences between males and females.
g1 <- ggplot(alc, aes(x= high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("Grades") + ggtitle("Grades by alcohol consumption and sex")
In the second one we can actually observe that those in a relationship have more absences, but most important, it supports our hypothesis that students with high alcohol consumption have more absences.
g2 <- ggplot(alc, aes(x= high_use, y = absences, col = romantic))
g2 + geom_boxplot() + ylab("Absences") + ggtitle("Absences by alcohol consumption and relationship")
This last graph is quite surprising, seeing that the only significant relationship is with students in a relationship and failures. We don’t see a relationship between failures and high/low alcohol consumption as we hypothesized. Most probably due the fact that failures variable was so skewed to 0 failures.
g3 <- ggplot(alc, aes(x= high_use, y = failures, col = romantic))
g3 + geom_boxplot() + ylab("Failures") + ggtitle("Failures by alcohol consumption and relationship")
We can also look at the differences nummerically.
First comparing target variable with final grades. We see that students with higher alcohol consumption have lower mean final grades, which supports our hypothesis.
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (G3))
## # A tibble: 2 x 3
## high_use count mean_grade
## <lgl> <int> <dbl>
## 1 FALSE 268 11.73507
## 2 TRUE 114 10.80702
We can also count how many of the students in a relationship or not have high/low alcohol consumption. And see the mean grades for each combination. The results support our initial hypothesis and confirm numerically what we could see on the previous graph. Proportionally, students with boyfriend/girlfriend has lower alcohol consumption (27% = 33/(33+88)) than those without (31% = 81/(180+81)).
Again we can see that students with high alcohol consumption have lower grades. And it is actually funny to see that students in a relationship have also lower grades, but that is not rellevant in our study.
alc %>% group_by (romantic, high_use) %>% summarise (count = n(), mean_grade = mean (G3))
## # A tibble: 4 x 4
## # Groups: romantic [?]
## romantic high_use count mean_grade
## <fctr> <lgl> <int> <dbl>
## 1 no FALSE 180 12.00556
## 2 no TRUE 81 11.00000
## 3 yes FALSE 88 11.18182
## 4 yes TRUE 33 10.33333
In the next table we can see how high alcohol consumption has lower mean grades, higher absences and higher failures.
alc %>% group_by (high_use) %>% summarise (count = n(), mean_grade = mean (G3), mean_absences = mean(absences), mean_failures = mean(failures))
## # A tibble: 2 x 5
## high_use count mean_grade mean_absences mean_failures
## <lgl> <int> <dbl> <dbl> <dbl>
## 1 FALSE 268 11.73507 3.705224 0.1417910
## 2 TRUE 114 10.80702 6.368421 0.3421053
If we add to the previous table the romantic variable, we can observe numerically what we have seen on the previous graphs. Students in a romantic relationship have slightly lower grades. Moreover, on the one hand, students with girl/boyfriend have higher absences and failures when they have high alcohol use, but on the other hand , when they have low alcohol use, they have higher absences but lower failures.
alc %>% group_by (high_use, romantic) %>% summarise (count = n(), mean_grade = mean (G3), mean_absences = mean(absences), mean_failures = mean(failures))
## # A tibble: 4 x 6
## # Groups: high_use [?]
## high_use romantic count mean_grade mean_absences mean_failures
## <lgl> <fctr> <int> <dbl> <dbl> <dbl>
## 1 FALSE no 180 12.00556 3.405556 0.1444444
## 2 FALSE yes 88 11.18182 4.318182 0.1363636
## 3 TRUE no 81 11.00000 5.777778 0.3086420
## 4 TRUE yes 33 10.33333 7.818182 0.4242424
After looking at the data and the possible relationships between variables, now we are going to create a model with the chosen variables, and see if the hypothesis we have are actually statistically confirmed or rejected.
So our model would be:
\[ high-use = romantic + grades + failures + absences \]
The results of the logistic regression clearly shows that we did not choose the most proper variables, because what the previous exploration of the data suggested is not what we see now after applying the regression model. In our model, the only statistically significant variables are absences and failures, leaving romantic and grades out. Most probably choosing romantic as an explanatory variable biased the whole model. So if required, it would be adequate to perform again the regression with different variables and check if we could obtain a better model to explain high/low alcohol consumption.
m <- glm(high_use ~ romantic + absences + failures + G3, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ romantic + absences + failures + G3,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2106 -0.7928 -0.7038 1.1546 1.9020
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.67924 0.49112 -1.383 0.166652
## romanticyes -0.34058 0.25850 -1.318 0.187658
## absences 0.08623 0.02282 3.779 0.000157 ***
## failures 0.39863 0.20005 1.993 0.046301 *
## G3 -0.05083 0.03818 -1.332 0.183020
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 436.93 on 377 degrees of freedom
## AIC: 446.93
##
## Number of Fisher Scoring iterations: 4
Next we will interpret the coefficients of the model as odd ratios and provide confidence intervals for them.
The odds ratio quantifies the relationship between X and Y. In this case Y is high alcohol use and X are the explanatory variables used in our model: romantic, absences, failures and grades.
For example, if we take the X as romantic, the odds ratio measures the odds of having high alcohol consumption when the student is in a relationship divided the odds of having high alcohol consumption when the students is single.
\[OR = odds (Y | X) / Odds (Y | without X )\]
The computational target variable is in logistic regression, the log of the odds. So if we apply exponential function to the fitted values, we will obtain the odds.
So in the OR that we have computed we can observe that Students in a relationship are only 0.7 as likely to be high alcohol consumers than those without. Or in other words, single students are 1.4 times more likely to be high alcohol consumers (to see the odd ratio for “romanticno”, we just calculate the inverse, so 0.71^-1 = 1.41).
We must interpret the odds ratio for non-binary variables as the ratio of being high alcohol consumer and one unit of change in variable X, divided by the ratio of being high alcohol consumer and no change in variable X:
\[ exp(coef) = Odds(Y | X+1)/Odds(Y | X) \]
Therefore, with the odds ratio of absences, we can interpret that a student with one more absence is 1.09 more likely to be high alcohol consumer than the student who don’t have that one more absence. Of course, with failures, the possible values are more limited (only from 0 to 4), so the change is bigger. The student with one more failure is 1.5 times more likely to be high alcohol consumer than the student without that extra failure.
Grades is the only variable with OR and CI lower than 1, meaning that the correlation is negative. The student having 1 grade higher is 1.05 less likely to be high alcohol consumer than the student without that 1 grade higher, e.g.: the student with grade 4 is 5% less likely to be high alcohol consumer than the one with grade 3.
The confidence intervals shows us where are the vast majority of observations. So for the romantic variable, the CI is very wide, that is why we have not found statistical correlation. For both romantic and G3, the CI go from lower than 1 to higher than 1, so it is unclear if the correlation with target variable is postive or negative, so that is another reason that they cannot be significant.
Therefore we confirm our hypothesis that high alcohol consumption is positively related to absences and failures. But reject our hypothesis that high alcohol consumption is related anyhow to romantic relationships and grades.
# To compute OR (odds ratio) and CI (confidence intervals:
OR <- coef(m) %>% exp()
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# Print them out:
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.5070040 0.1908492 1.316612
## romanticyes 0.7113585 0.4240480 1.171181
## absences 1.0900618 1.0446648 1.142606
## failures 1.4897812 1.0079375 2.221726
## G3 0.9504370 0.8816401 1.024407
Now, we will select only the significant variables (based on the previous logistic regression model) and create a new model to explore the predictive power of our model.
First we write the new model, then we predict the probability of high use based on the response of our model. After we add the probabilities and the predictions to our data set. Predictions would be true only when prob > 0.5. Finally we make a cross tabulation.
On the first table we can observe that for high use consumers, there is only 15 where prediction would be true and 99 false, so in 99 of the cases the prediction said it would not be a high alcohol consumer when in fact it is. We can also observe than in 10 of the cases the model predicted the student to be high alcohol consumer when he/she was not. That shows already the weak prediction power of our model.
On the second table we can see it in percentages, only 3.9% of predictions are true for high alcohol use students and 68% of predictions are true for low alcohol use students.
To sum up, our model does the right prediction in 71.4% of the cases which is reasonable but could be improved.
# We write the new model:
m2 <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ failures + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2680 -0.8019 -0.6967 1.2522 1.7906
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.37829 0.16479 -8.364 < 2e-16 ***
## failures 0.49972 0.18456 2.708 0.006776 **
## absences 0.08602 0.02287 3.762 0.000169 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 440.14 on 379 degrees of freedom
## AIC: 446.14
##
## Number of Fisher Scoring iterations: 4
# We predict the probability of high_use, based on the response of our model:
probabilities <- predict(m2, type = "response")
# We add the probabilities to our data set:
alc <- mutate(alc, probability = probabilities)
# Now we use these probabilities to make a prediction of high use (the prediction will only be true if probability is higher than 0.5):
alc <- mutate(alc, prediction = probability > 0.5)
# Let's tabulate the predictions versus the actual values:
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 258 10
## TRUE 99 15
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67539267 0.02617801 0.70157068
## TRUE 0.25916230 0.03926702 0.29842932
## Sum 0.93455497 0.06544503 1.00000000
We can also draw a plot to see the true values of high_use against the predictions:
# We can also see it graphically:
g <- ggplot(alc, aes(x=probability, y=high_use, col=prediction))
g + geom_point()
Finally, to measure the accuracy of our model, we will compute the total proportion of inaccurately classified individuals (=incorrect predictions in the training data). We will define a loss function for this purpose and then compute the average number of wrong predictions in the training data. We obtained that 0.29 or 29% of predictions are wrong.
The performance of the model is reasonable but one can guess that we could have found a better model for the prediction of high/low alcohol consumers. The simple guessing strategy gave us quite good understanding but still we were wrong for 2 of the 4 variables so, that is questionable.
loss_func <- function(class, prob){
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class=alc$high_use, prob=alc$probability)
## [1] 0.2853403
As we have seen, statistical models can be used to make predictions . Cross-validation aims to estimate how accurately a predictive model will perform in practice.
In cross-validation, we calculate the same loss function as before, but with different data than the one used for defining the model (=testing data). Or what is the same, we can calculate the average of wrong predictions in the testing data.
# 10-fold cross-validation:
library(boot)
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m2, K=10)
# Average number of wrong prediction in the testing data with a 10-fold cross validation:
cv$delta[1]
## [1] 0.2879581
The average number of wrong predictions in cross validation for my model, is 0.29, which is higher than the error from example model in Datacamp, 0.26. So my model has worse test set performance, as one could have expected at this point of the exercise.
We can make different logistic regression models and see their performance, looking at the training at testing errors. In the following models (model 3 to model 7) we have started with 6 explanatory variables, and every next model we erase the less significant variable.
Looking at the errors, we cannot see any tendency if either having more or less variables improves the model performance.
Model 3 (6 predictors):
# Model 3, 6 predictors:
m3 <- glm(high_use ~ sex + failures + absences + G3 + romantic + goout, data = alc, family = "binomial")
summary(m3)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + G3 + romantic +
## goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1319 -0.7948 -0.5292 0.8075 2.4357
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.66129 0.72329 -5.062 4.15e-07 ***
## sexM 0.90410 0.25705 3.517 0.000436 ***
## failures 0.29568 0.22493 1.315 0.188669
## absences 0.08370 0.02243 3.732 0.000190 ***
## G3 -0.02973 0.04204 -0.707 0.479487
## romanticyes -0.30928 0.27986 -1.105 0.269103
## goout 0.69962 0.12050 5.806 6.40e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 383.45 on 375 degrees of freedom
## AIC: 397.45
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities3 <- predict(m3, type = "response")
alc <- mutate(alc, probability3 = probabilities3)
alc <- mutate(alc, prediction3 = probability3 > 0.5)
#Training error m3:
loss_func(class=alc$high_use, prob=alc$probability3)
## [1] 0.2015707
#Testing error m3:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m3, K=10)
cv$delta[1]
## [1] 0.2094241
Model 4 (5 predictors):
# Model 4, 5 predictors:
m4 <- glm(high_use ~ sex + failures + absences + romantic + goout, data = alc, family = "binomial")
summary(m4)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + romantic +
## goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0766 -0.7901 -0.5395 0.7706 2.4403
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.05277 0.47663 -8.503 < 2e-16 ***
## sexM 0.90227 0.25695 3.511 0.000446 ***
## failures 0.35231 0.21070 1.672 0.094508 .
## absences 0.08466 0.02244 3.773 0.000161 ***
## romanticyes -0.29147 0.27850 -1.047 0.295294
## goout 0.70951 0.11992 5.916 3.29e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 383.95 on 376 degrees of freedom
## AIC: 395.95
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities4 <- predict(m4, type = "response")
alc <- mutate(alc, probability4 = probabilities4)
alc <- mutate(alc, prediction4 = probability4 > 0.5)
#Training error m4:
loss_func(class=alc$high_use, prob=alc$probability4)
## [1] 0.2172775
#Testing error m4:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m4, K=10)
cv$delta[1]
## [1] 0.2277487
Model 5 (4 predictors):
# Model 5, 4 predictors:
m5 <- glm(high_use ~ sex + failures + absences + goout, data = alc, family = "binomial")
summary(m5)
##
## Call:
## glm(formula = high_use ~ sex + failures + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0299 -0.7890 -0.5417 0.7857 2.3588
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.13390 0.47292 -8.741 < 2e-16 ***
## sexM 0.91999 0.25625 3.590 0.000330 ***
## failures 0.34560 0.21093 1.638 0.101330
## absences 0.08243 0.02235 3.688 0.000226 ***
## goout 0.70797 0.11998 5.901 3.62e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 385.06 on 377 degrees of freedom
## AIC: 395.06
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities5 <- predict(m5, type = "response")
alc <- mutate(alc, probability5 = probabilities5)
alc <- mutate(alc, prediction5 = probability5 > 0.5)
#Training error m5:
loss_func(class=alc$high_use, prob=alc$probability5)
## [1] 0.2146597
#Testing error m5:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m5, K=10)
cv$delta[1]
## [1] 0.2225131
Model 6 (3 predictors):
# Model 6, 3 predictors:
m6 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")
summary(m6)
##
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities6 <- predict(m6, type = "response")
alc <- mutate(alc, probability6 = probabilities6)
alc <- mutate(alc, prediction6 = probability6 > 0.5)
#Training error m6:
loss_func(class=alc$high_use, prob=alc$probability6)
## [1] 0.2094241
#Testing error m6:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m6, K=10)
cv$delta[1]
## [1] 0.2146597
Model 7 (2 predictors):
# Model 7, 2 predictors:
m7 <- glm(high_use ~ sex + goout, data = alc, family = "binomial")
summary(m7)
##
## Call:
## glm(formula = high_use ~ sex + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5599 -0.8726 -0.6260 0.8382 2.4893
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8130 0.4529 -8.420 < 2e-16 ***
## sexM 0.8736 0.2466 3.543 0.000396 ***
## goout 0.7609 0.1181 6.441 1.19e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 402.74 on 379 degrees of freedom
## AIC: 408.74
##
## Number of Fisher Scoring iterations: 4
# Adding probability and prediction columns to our data
probabilities7 <- predict(m7, type = "response")
alc <- mutate(alc, probability7 = probabilities7)
alc <- mutate(alc, prediction7 = probability7 > 0.5)
#Training error m7:
loss_func(class=alc$high_use, prob=alc$probability7)
## [1] 0.2094241
#Testing error m7:
cv <- cv.glm(data=alc, cost=loss_func, glmfit = m7, K=10)
cv$delta[1]
## [1] 0.2356021
In this chapter we will learn how to classify data using linear discriminant analysis and k-means, one of the most known methods of clustering.
We will use the dataset called Boston from the MASS package. The data has 506 observations from 14 variables. The dataset gives different housing values in suburbs of Boston. E.g.: per capita crime rate by town, proportion of industries per town, nitrogen oxides concentration, etc. More information about the data can be found here: https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/Boston.html
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
Next we can see a graphical overview of the data and the summaries of the variables.
Because we have so many variables, it would be hard to do a scatter plot for each combination of variables, an easier way to see what is the relationship between variables is making a correlation matrix.
In our matrix we can see visually which variables are more related and if they are positively or negatively related (with the colours and size of the circles), and we can also see the correlations numerically.
The highest correlation we find it between rad and tax, which means index of accessibility to radial highways and full-value property-tax rate, so obviously, the tax rate depends a lot on whether the property is easily accessible by radial highways or not. We can see how many other variables are also well correlated in the graphic.
install.packages(“corrplot”) install.packages(“tidyverse”)
library(corrplot)
## corrplot 0.84 loaded
library(tidyverse)
## -- Attaching packages ---------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.3.4 v purrr 0.2.4
## v readr 1.1.1 v stringr 1.2.0
## v tibble 1.3.4 v forcats 0.2.0
## -- Conflicts ------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
cor_matrix <- cor(Boston) %>% round(digits=2)
corrplot.mixed(cor_matrix, lower.col = "black", number.cex = .6)
We can see a summary of the variables in the following tables. For example for variable crime we can see the mean crime rate per capita is 3.6 and the max 88.9 which looks quite alarming. Or the mean concentration of nitrogen oxides is 0.55ppm.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Finally in the following graphs we can visually see how the variables are distributed.
gather(Boston) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Linear Discriminant analysis makes the assumption that variables are normally distributed and the variances of each variable is the same, that is why it is better to scale the data before fitting our model.
We can see next the summaries of the scaled data. Note how the values have changed, compared to the previous summary. Now the mean for each of the variables is 0. So that indicates that all the variables have the same scale now.
After scaling, we change the object to data frame format, so we can use it later as data.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
First we will create a categorical variable of the crime rate in the Boston dataset. For that we will use the quantiles as the break points in the categorical variable.
# First we create the quantile vector of crim:
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# Then we create the categorical variable "crime":
crime <- cut(boston_scaled$crim, breaks = bins, inlcude.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
## low med_low med_high high
## 126 126 126 127
Then we will remove the old variable “crim” and add the new one “crime” to the dataset.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
Later, we divide the dataset to train and test sets, so that 80% of the data belongs to the train set. This will allow us to make predictions later in this chapter.
n <- nrow(boston_scaled)
# Choose randomly 80% of the rows:
ind <- sample(n, size = n*0.8)
# create the train set;
train <- boston_scaled[ind,]
# create the test set:
test <- boston_scaled[-ind,]
Now we fit the linear discriminant analysis (LDA) on the train dataset, the target variable will be crime and all the other variables as predictors.
lda.fit <- lda(crime ~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2450495 0.2549505 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 0.9452272 -0.9052783 -0.08304540 -0.8554766 0.5046246
## med_low -0.1266824 -0.2972299 -0.11325431 -0.5507297 -0.1382393
## med_high -0.3773415 0.1436590 0.18636222 0.3259555 0.1021377
## high -0.4872402 1.0149946 -0.07145661 1.0574032 -0.4924035
## age dis rad tax ptratio
## low -0.8233322 0.8164615 -0.6859201 -0.7670454 -0.41693843
## med_low -0.3397576 0.3150936 -0.5503260 -0.4842909 -0.09517036
## med_high 0.4075369 -0.3436748 -0.4143278 -0.3238275 -0.22790304
## high 0.8379864 -0.8785295 1.6596029 1.5294129 0.80577843
## black lstat medv
## low 0.3733708 -0.779213200 0.56946034
## med_low 0.3173490 -0.141298718 0.01389565
## med_high 0.1007325 -0.005105865 0.15699428
## high -0.7996197 0.923388990 -0.70798130
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.060298822 0.820330637 -0.95278093
## indus -0.002961352 -0.261213900 0.25986244
## chas -0.092402296 -0.060739189 -0.06900523
## nox 0.283864177 -0.723227414 -1.42891605
## rm -0.116777335 -0.096898325 -0.21612850
## age 0.245778061 -0.258995405 -0.25985656
## dis -0.102969111 -0.332439809 0.07268566
## rad 3.364689351 0.998666886 0.01791423
## tax 0.034356711 -0.108689899 0.62462892
## ptratio 0.134780360 0.076461092 -0.39453534
## black -0.153769976 0.005758998 0.14982210
## lstat 0.207263137 -0.323658718 0.29741795
## medv 0.164342241 -0.382332781 -0.17958713
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9534 0.0342 0.0124
Finally we draw the LDA biplot to visualize how the data is distributed. Each class have a different colour and the arrows represent the predictor variables, whose lenght and direction are based on the coefficients and show their impact on the model.
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale = 1)
First we will save the crime categories from the test set and then remove the crime variable from the test dataset.
# we save the correct classes from test data:
correct_classes <- test$crime
# remove the crime variable from the test data:
test <- dplyr::select(test, -crime)
Based on the trained model, the LDA model we created calculates the probability of new observations to belong in each of the classes and it classifies it to the class with the highest probability.
We predict the classes with LDA model on the test data.
lda.pred <- predict(lda.fit, newdata = test)
Finally we cross tabulate the results of our predictions using the crime categories from the test set that we saved at the beginning of this 3.4. We can observe how most of predictions are correct, only 24 of the predictions are incorrect. If we look at the probabilities, we can see how 75.4% of the predictions are correct.
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 13 9 0 0
## med_low 5 17 5 0
## med_high 0 3 19 1
## high 0 0 1 28
table(correct = correct_classes, predicted = lda.pred$class) %>% prop.table() %>% addmargins()
## predicted
## correct low med_low med_high high Sum
## low 0.12871287 0.08910891 0.00000000 0.00000000 0.21782178
## med_low 0.04950495 0.16831683 0.04950495 0.00000000 0.26732673
## med_high 0.00000000 0.02970297 0.18811881 0.00990099 0.22772277
## high 0.00000000 0.00000000 0.00990099 0.27722772 0.28712871
## Sum 0.17821782 0.28712871 0.24752475 0.28712871 1.00000000
First we scale the dataset Boston.
boston_scaled2 <- scale(Boston)
summary(boston_scaled2)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# change the object to data frame
boston_scaled2 <- as.data.frame(boston_scaled2)
We calculate the distances between the observations. We use the most common, euclidean distance.
dist_eu <- dist(boston_scaled2)
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Then we run the k-means algorithm on the dataset. First we put a random number of clusters or initial cluster centers.
km <- kmeans(boston_scaled2, centers = 3)
Now we will investigate what is the optimal number of clusters. For this purpose we will first set a max number of clusters to 10. Then we will use one of the most used methods for deciding the number of cluster, sum of squares.The sum of squares or total within cluster sum of squares (TWCSS), is the sum of within cluster sum of squares (WCSS). So when you plot the number of clusters and the total WCSS, the optimal number of clusters is when the total WCSS drops radically.
# avoid the kmeans to give us every time different results
set.seed(123)
# set the maximum number of clusters to 10
k_max <- 10
# calculate the total sum of squares:
twcss <- sapply(1:k_max, function(k){kmeans (boston_scaled2, k)$tot.withinss})
# see graphically the optimal number of clusters:
qplot(x = 1:k_max, y = twcss, geom = 'line')
From the above graph we can prove how the optimal number of clusters is 2. Soo we run the k means algorithm again and visualize the results. First we see a matrix with all the variables and then zoomed for variables from 1 to 5 and from 6 to 10. On the following graphics we can observe the two clusters on the data, in red and black, and how they relate for each combination of variables
km <- kmeans(boston_scaled2, centers = 2)
pairs(boston_scaled2, col=km$cluster)
pairs(boston_scaled2[1:5], col=km$cluster)
pairs(boston_scaled2[6:10], col=km$cluster)
First we scale the Boston dataset.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
boston_scaled3 <- scale(Boston)
summary(boston_scaled3)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Then we perform k-means.
km2 <- kmeans(boston_scaled3, center = 3)
boston_scaled3 <- as.data.frame(boston_scaled3)
Then we perform LDA using the clusters as target classes.
lda.fit2 <- lda(km2$cluster ~., data=boston_scaled3)
Visualize the results. We can observe how the most influential variables which separate the clusters are accessibility to radial highways (rad), nitrogen oxides concentration (nox) and proportion of residential land zoned for lots over 25,000 sq.ft (zn).
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(km2$cluster)
# plot the lda results
plot(lda.fit2, dimen = 2, col=classes, pch= classes)
lda.arrows(lda.fit, myscale = 1)
First we run the code below for the scaled train data that we used to fit the LDA. The code creates a matrix product, which is a projection of the data points.
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
Then we draw a plot.
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
Then we draw the same plot put assigning the colors to be the crime classes of the train set.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
Finally, we draw the same plot but now setting the colors to be the clusters of the k-means.
{r 3d plot3} plot_ly(x = matrix_product\(LD1, y = matrix_product\)LD2, z = matrix_product\(LD3, type= 'scatter3d', mode='markers', color = km\)cluster)
In the second graphic we can identify which observations belong to which crime classes and we can observe there is some kind of clustering within classes, especially with the class high, there is a clear cluster. The rest are not as clear.
In the last graphic we should observe the graphic where each color is a cluster, and as we could have guessed, the optimal is only 2 clusters. One for high and another one for the rest of crime classes. For some reason we could not proceed to draw this graphic.